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Extracting useful information from high-dimensional data is an 
important focus of today's statistical research and practice. Penal- 
ized loss function minimization has been shown to be effective for 
this task both theoretically and empirically. With the virtues of both 
regularization and sparsity, the Li-penalized squared error minimiza- 
tion method Lasso has been popular in regression models and beyond. 

In this paper, we combine different norms including L\ to form 
an intelligent penalty in order to add side information to the fitting 
of a regression or classification model to obtain reasonable estimates. 
Specifically, we introduce the Composite Absolute Penalties (CAP) 
family, which allows given grouping and hierarchical relationships 
between the predictors to be expressed. CAP penalties are built by 
defining groups and combining the properties of norm penalties at 
the across-group and within-group levels. Grouped selection occurs 
for nonoverlapping groups. Hierarchical variable selection is reached 
by defining groups with particular overlapping patterns. We propose 
using the BLASSO and cross-validation to compute CAP estimates 
in general. For a subfamily of CAP estimates involving only the L\ 
and Loo norms, we introduce the iCAP algorithm to trace the en- 
tire regularization path for the grouped selection problem. Within 
this subfamily, unbiased estimates of the degrees of freedom (df) are 
derived so that the regularization parameter is selected without cross- 
validation. CAP is shown to improve on the predictive performance of 
the LASSO in a series of simulated experiments, including cases with 
p^>n and possibly mis-specified groupings. When the complexity of 
a model is properly calculated, iCAP is seen to be parsimonious in 
the experiments. 
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1. Introduction. Information technology advances are bringing the pos- 
sibility of new and exciting discoveries in various scientific fields. At the same 
time, they pose challenges for the practice of statistics because current data 
sets often contain a large number of variables compared to the number of 
observations. A paramount example is micro-array data, where thousands 
or more gene expressions are collected while the number of samples remains 
around a few hundreds (e.g., [11]). 

For regression and classification, parameter estimates are often defined as 
the minimizer of an empirical loss function L and too responsive to noise 
when the number of observations n is small with respect to the model di- 
mensionality p. Regularization procedures impose constraints represented 
by a penalization function T. The regularized estimates are given by: 

/3(A)=argmin[L(Z,/?)+A-T(/?)], 

P 

where A controls the amount of regularization and Z = (X, Y) is the (ran- 
dom) observed data: X is the n x p design matrix of predictors, and Y a 
n-dimensional vector of response variables (Y being continuous for regres- 
sion and discrete for classification). We restrict attention to linear models: 

(1) L(Z,p)=L^,J2p j X j y j, 

where Xj denotes the observed values for the jth predictor (jth column of 
X). Thus, setting 0j = corresponds to excluding Xj from the model. 

Selection of variables is a popular way of performing regularization. It 
stabilizes the parameter estimates while leading to interpretable models. 
Early variable selection methods such as Akaike's AIC [1], Mallow's C p [15] 
and Schwartz's BIC [19] are based on penalizing the dimensionality of the 
model. Penalizing estimates by their Euclidean norm (ridge regression [12]) 
is commonly used among statisticians. Bridge estimates [9] use the L 7 -norm 
on the (3 parameter defined as ||/?|| 7 = (X)?=i l/%l 7 ) 1/ ' 7 as a penalization: 
they were first considered as a unifying framework in which to understand 
ridge regression and variable selection (the dimensionality of the model being 
interpreted as the "Lo- norm " of the coefficients) . More recently, the nonneg- 
ative garrote [3], wavelet shrinkage [5], basis pursuit [4] and the LASSO [22] 
have exploited the convexity [2] of the Li-norm as a more computationally 
tractable means for selecting variables. 

For severely ill-posed estimation problems, sparsity alone may not be suf- 
ficient to obtain stable estimates [26] . Group or hierarchical information can 
be a source of further regularization constraints. Sources of such information 
vary according to the problem at hand. A grouping of the predictors may 
arise naturally: for categorical variables, the dummy variables used to rep- 
resent different levels define a natural grouping [23] . Alternatively, a natural 
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hierarchy may exist: an interaction term is usually only included in a model 
after its corresponding main effects. More broadly, in applied work, expert 
knowledge is a potential source of grouping and hierarchical information. Af- 
ter completing this work, we have also learned that grouping of parameters 
also occur naturally when fitting various regressions simultaneously [16]. 

In this paper, we introduce the Composite Absolute Penalties (CAP) fam- 
ily of penalties. CAP penalties are highly customizable and build upon L'y 
penalties to express both grouped and hierarchical selection. The overlapping 
patterns of the groups and the norms applied to the groups of coefficients 
used to build a CAP penalty determine the properties of the associated esti- 
mate. The CAP penalty formation assumes a known grouping or hierarchical 
structure on the predictors. For group selection, clustering techniques can 
be used to define groups of predictors before the CAP penalty is applied. In 
our simulations, this two-step approach has resulted in CAP estimates that 
were robust to misspecified groups. 

Zou and Hastie [26] (the Elastic Net), Kim, Kim and Kim [14] (Blockwise 
Sparse Regression) and Yuan and Lin [23] (GLASSO) have previously ex- 
plored combinations of the Li-norm and L2-norm penalties to achieve more 
structured estimates. The CAP family extends these ideas in two directions: 
first, it allows different norms to be combined and; second, different over- 
lapping of the groups are allowed to be used. These extensions result both 
in computational and modeling gains. By allowing norms other than L\ and 
L2 to be used, the CAP family allows computationally convenient penalties 
to be constructed from the L\ and Loo norms. By letting the groups over- 
lap, CAP penalties can be constructed to represent a hierarchy among the 
predictors. In Section 2.2.2, we detail how the groups should overlap for a 
given hierarchy to be represented. 

CAP penalties built from the L\ and Loo-norms are computationally 
convenient, as their regularization paths are piecewise linear for piecewise 
quadratic loss functions [18]. We call such group of penalties the iCAP family 
("i" standing for the infinity norm). We extend the homotopy/LARS-LASSO 
algorithm [8, 17] and design fast algorithms for iCAP penalties in the cases of 
nonoverlapping group selection (the iCAP algorithm) and tree-hierarchical 
selection (the hierarchical iCAP algorithm: hiCAP). A Matlab implemen- 
tation of these algorithms is available from: http://www.stat.berkeley.edu/ 
twiki /Research / YuGroup / Software . 

For iCAP penalties, used with the L2-I0SS, unbiased estimates of the de- 
grees of freedom of models along the path can be obtained by extending 
the results in Zou, Hastie and Tibshirani [27]. It is then possible to em- 
ploy information theory criteria to pick an estimate from the regularization 
path and thus avoiding the use of cross validation. Models picked from the 
iCAP path using Sugiura's [21] AIC C and the degrees of freedom estimates 
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had predictive performance comparable to cross- validated models even when 
n -C p. 

The computational advantage of CAP penalties is preserved in a broader 
setting. We prove that CAP penalties is convex whenever all norms used in 
its construction are convex. Based on this, we propose using the BLASSO 
algorithm [24] to compute the CAP regularization path and cross-validation 
to select the amount of regularization. 

Our experimental results show that the inclusion of group and hierar- 
chical information substantially enhance the predictive performance of the 
penalized estimates in comparison to the LASSO. This improvement was 
preserved even when empirically determined partitions of the set of pre- 
dictors was severely mis-specified and was observed for different settings of 
the norms used to build CAP. While the CAP estimates are not sparser 
than LASSO estimates in the number of variables sense, they result in more 
parsimonious use of degrees of freedom and more stable estimates [7] . 

The remainder of this paper is organized as follows. In Section 2, for a 
given grouping or hierarchical structure, we define CAP penalties, relate 
them to the properties of 7-norm penalized estimates and detail how to 
build CAP penalties from given group and hierarchical information. Section 

3 proves the convexity of the CAP penalties and describes the computation 
of CAP estimates. We propose algorithms for tracing the CAP regulariza- 
tion path and methods for selecting the regularization parameter A. Section 

4 gives experimental results based on simulations of CAP regression with 
the L2-I0SS and explores a data-driven group formation procedure showing 
that CAP estimates enjoy some robustness relative to possibly mis-specified 
groupings. Section 5 concludes the paper with a brief discussion. 

2. The Composite Absolute Penalty (CAP) family. We first give a re- 
view of the properties of L 7 -norm penalized estimates. Then we show how 
CAP penalties exploit them to reach grouped and hierarchical selection. For 
overlapping groups, our focus will be on how to overlap groups so hierarchi- 
cal selection is achieved. 

2.1. Preliminaries: properties of bridge regressions. We consider an ex- 
tended version of the bridge regression [9] where a general loss function 
replaces the L2-I0SS. The bridge regularized coefficients are given by 

(2) /3 7 (A) = argmin[L(Z,/3) + A-T(/?)] 

P 

with T(/?) HI/311} = X>;P- 

3=1 

The properties of bridge estimates path vary considerably according to the 
value chosen for 7. The estimates tend to fall in regions of high "curvature" 



GROUPED AND HIERARCHICAL SELECTION 
7 = 1.1 7 = 2 7 = 4 











r 








I 


J 









Fig. 1. Regularization paths of bridge regressions. Upper Panel: Regularization paths 
for different bridge parameters for the diabetes data. From left to right: Lasso (j — 1), 
near-Lasso (j = 1.1), Ridge (>y = 2), over-Ridge fy = A), max(7 = oo y ). The horizontal 
and vertical axis contain the 'y-norm of the normalized coefficients and the normalized 
coefficients respectively. Lower Panel: Contour plots [| ^2) ||-y = 1 for the correspona 
penalties. 



of the penalty contour plot: for < 7 < 1, some estimated coefficients are 
set to zero; for 1 < 7 < 2, estimated coefficients lying in directions closer to 
the axis are favored; for 7 = 2, the estimates are not encouraged to lie in 
any particular direction; and finally for 2 < 7 < 00, they tend to concentrate 
along the diagonals. Figure 1 illustrates the different behavior of bridge 
regression estimates for different values of 7 for the diabetes data used in 
Efron et al. [8]. 

CAP penalties exploit the distinct behaviors of the bridge estimates ac- 
cording to whether 7 = 1 or 7 > 1. For convex L and 7 > 1, the bridge 
estimates are fully characterized by the Karush-Kuhn-Tucker (KKT) con- 
ditions (see [2]): 

(3) |^ = -A^& = -A • sign(/3 J )^^- for j such that f3 3 + 0; 



9/3,- dj3. 



3 w h>3 ||P||7 



(4) 



dL 


< A 




df3 j 




dp, 



A — 3 , for j such that /5L = 0. 
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Hence, for f < 7 < 00, the estimate (3j equals zero if and only if \p j=0 
0. This condition is satisfied with probability zero when the distribution of 
Zi = (Xi,Y{) is continuous and L is strictly convex. Therefore, 1 < 7 < 00 
implies that all variables are almost surely included in the bridge estimate. 
When 7 = f , however, the right-hand side of (4) becomes a constant set by 
A and thus variables that fail to infinitesimally reduce the loss by a certain 
threshold are kept at zero. In what follows, we show how these distinctive 
behaviors result in group and hierarchical selections. 
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2.2. CAP penalties. We start this subsection by defining the CAP family 
in its most general form. Then for a given group or hierarchical structure, 
we specialize the CAP penalty for grouped and hierarchical selection. Un- 
less otherwise stated, we assume that each predictor Xj in what follows is 
normalized to have mean zero and variance one. 

Let 

l={l,...,p} 

contain all indices of the predictors. Given K subsets of indices, 

GkCl. 

The group formation varies according to the given grouping or hierarchical 
structure that we want to express through the CAP penalty. Details are 
presented later in this section. We let a given grouping be denoted by 

G = (Gi, ■ ■ -,Gk)- 

Moreover, a vector of norm parameters 7 = (70, 71, • • • , ^k) & must 
be defined. We let 7^ = c denote the case 7^ = c, V7c > 1. 

Call the L 7o -norm the overall norm and L 7fe -norm the kth group norm 
and define 

A? fe = (Pj)jeg k , 
(5) N k = \\/3g k \\y k and 

N = (N 1 ,...,N K ) for k = l,...,K. 
The CAP penalty for grouping G and norms 7 is given by 



(6) Tg„(j3) = \\N\m 



k 



The corresponding CAP estimate for the regularization parameter A is 
(7) $g„(\) = argmin[L(Z,/5) + A • Tg 

In its full generality, the CAP penalties defined above can be used to 
induce a wide array of different structures in the coefficients: 70 determines 
how groups relate to one another while 7^ dictates the relationship of the 
coefficients within group k. The general principle follows from the distinctive 
behavior of bridge estimates for 7 > 1 and 7 = 1 as discussed above. Hence, 
for 70 = 1 and 7& > 1 for all k, the variables in each group are selected 
as a block [14, 23]. Nonoverlapping groups yield grouped selection, while 
suitably constructed overlapping patterns can be used to achieve hierarchical 
selection. More general overlapping patterns and norm choices are possible, 
but we defer their study for future research as they are not needed for our 
goal of grouped and hierarchical selection. 
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Fig. 2. Effect of group-norm on regularization path. In this figure, we show the regulariza- 
tion path for CAP penalties with different group norms applied to the diabetes data in Efron 
et al. [8J. The predictors were split into three groups: the first group contains age and sex; 
the second, body mass index and blood pressure; and the third, blood serum measurements. 
From left to right, we see: (a) Lasso (70 = 1,7k = 1); (b) CAP(l.l), (70 = l,7fc = 1.1); (c) 
GLasso (70 = 1,7k = 2), (d) CAP(A), (70 = l, 7 fe = 4); (e) iCAP{ 10 = l, 7fe = 00). 

2.2.1. Grouped selection: the nonoverlapping groups case. When the goal 
of the penalization is to select or exclude nonoverlapping groups of variables 
simultaneously and the groups are known, we form nonoverlapping groups 
Qk, k = 1, . . . , K to reflect this information. That is, all variables to be added 
or deleted concurrently should be collected in one group Qf. € Q. 

Given the grouping G, the CAP penalization can be interpreted as mimick- 
ing the behavior of bridge penalties on two different levels: an across-group 
and a within-group level. On the across-group level, the group norms 
behave as if they were penalized by a L 7o -norm. On the within-group level, 
the 7^ norm then determines how the coefficients (5g k relate to each other. A 
formal result establishing this is established in a Bayesian interpretation of 
CAP penalties for nonoverlapping groups presented in details in the techni- 
cal report version of this paper [25]. For 70 = 1, sparsity in the N vector of 
group norms is promoted. The 7^ > 1 parameters then determine how close 
together the size of the coefficients within a selected group are kept. Thus, 
Yuan and Lin's [23] corresponds to the LASSO on the across-group level and 
the rotational invariant ridge penalization on the within-group level. Figure 
2 illustrates this fact for the diabetes data from Efron et al. [8]. 

By allowing group norms other than L2 to be applied to the coefficients 
in a group, CAP can lead to computational savings by setting 70 = 1 and 
7^ = 00. In Section 3 below, we present computationally efficient algorithm 
and model selection criterion for such CAP penalties. 

In Section 4, simulation experiments provide compelling evidence that the 
addition of the group structure can greatly enhance the predictive perfor- 
mance of an estimated model. 

A note on normalization. Assuming the predictors are normalized, it 
may be desirable to account for the size of different groups when build- 
ing the penalties. For 70 = 1, 7fe = 7 and letting 7* = =31", the decision on 
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whether group Qk is in the model for a fixed A can be shown to depend upon 
\\V/3 g L(Z,(3(X))\\j*. Thus, larger groups are more likely to be included in 
the model purely due to their size. We propose that group normalization 

1 /'Y* 

is achieved by dividing the variance normalized predictors in Qk by q k 
Following Yuan and Lin [23], such correction causes two hypothetical groups 
Qk and Qy having WV^L^Z, /3(A)) || = c for all j € Qk ^Qw to be included 
in the CAP estimate simultaneously. Note that for 7 = 1 (LASSO), 7* = 00 
and the group sizes are ignored as in this setting the group structure is lost. 

In an extended technical report version of this paper [25], we perform 
experiments suggesting that the additional normalization by group size does 
not affect the selection results greatly. This aspect of grouped CAP penalties 
is explained in further detail there. 

2.2.2. Hierarchical selection: the nested groups case. We now show how 
to define CAP penalties to achieve hierarchical selection. We start from an 
example used to illustrate the principle behind hierarchical CAP penalties, 
and then prove a result concerning CAP penalties with overlapping groups. 
Finally, we show how to use our result to build a penalty that induces a 
hierarchy starting from its representation as a directed graph. 

Consider a simple case involving two predictors X\ and X2. Suppose we 
want X\ to be included in the model before X2. A directed graph can be 
used to represent this hierarchy as shown in panel (a) in Figure 4. This 
hierarchy can be induced by defining the overlapping groups Q\ = {1, 2} and 
Q2 = {2} with 70 = 1, 7 m > 1 for m = 1,2. That results in the penalty 

(8) T(/3) = ||(/3 1 ,/3 2 )|| 71 + ||(/3 2 )|| 72 . 

The contour plots of this penalty function are shown in Figure 3 for 
different values of 7 = 71. As Q2 contains only one variable, these contours 
are the same regardless of the value chosen for 72 ( 1 1 1 1 -y 2 = I f° r an y 
72 ) . The breakpoints along the 02 = axis in panels (b) through (d) show 
that solutions with Pi ^ and 02 = tend to be encouraged by this penalty 
when 71 > 1. Setting 71 = 1, however, causes breakpoints to appear along 
the Pi = axis as shown in panel (a) , hinting that 71 > 1 is needed for the 
hierarchical structure to be preserved. 

In what refers to the definition of the groups, two things were important 
for the hierarchy to arise from penalty (8): first, 02 was in every group Pi 
was; second, there was one group in which /3 2 was penalized without Pi 
being penalized. As we will see below, having P2 in every group where Pi 
is ensures that, once P2 deviates from zero, the infinitesimal penalty of Pi 
becomes zero. In addition, letting P2 be on a group of its own makes it 
possible for Pi to deviate from zero, while P2 is kept at zero. 
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The general principle behind the example. The example above suggests 
that, to construct more general hierarchies, the key is to set 70 = 1, 7^ > 1 
for all k. Given such a 7, a penalty can cause a set of indices X\ to be added 
before the set Z2, by defining groups Gi = X2 and Gi =1\ UZ2- Our next 
result extends this simple case to more interesting hierarchical structures. 

Theorem 1. Let Ti,l2 C {1, . . . ,p} be two subsets of indices. Suppose: 

• 70 = 1 and 7^ > 1, V& = 1, . . . ,K . 

• I\ C Qk =>■ 2^2 C Qk for all k and 

• 3k* such that I2 C Qk* an d l-i Gk* ■ 

Then, ojj^-T({3) = whenever f3x 2 / and {3ji = 0. 

A proof is given in the Appendix A. Assuming the set {Z G M nx (P +1 ) : 
■gj^rL(Z, 0)\p x = p T p x _q = 0} to have zero probability, Theorem 1 states 

that once the variables in Z2 are added to the model, infinitesimal movements 
of the coefficients of variables in T\ are not penalized and hence /3j 1 will 
almost surely deviate from zero. 

Defining groups for hierarchical selection. Using Theorem 1, a group- 
ing for a more complex hierarchical structure can be constructed from its 
representation as a directed graph. Let each node correspond to a group of 
variables Gk and set its descendants to be the groups that should only be 
added to the model after Qk- The graph representing the hierarchy in the 
simple case with two predictors above is shown in the panel (a) of Figure 4. 
Based on this representation and Theorem 1, CAP penalties enforcing the 
given hierarchy can be obtained by setting 

nodes 

(9) T(/3) = ^ a m - \\{Pg m >Pan descendants of g m )ll7m 

m=l 




JO = 1.71 - 1 

(a) 



(b) 
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Fig. 3. Contour plots for the penalty in (8). 
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with a m > for all m. The factor a m can be used to correct for the effect 
of a coefficient being present in too many groups, a concern brought to our 
attention by one of the referees. In this paper, we keep a m = 1 for all m 
throughout. We return to this issue in our experimental section below. 

For a more concrete example, consider a regression model involving d 
predictors x\, . . . ,Xd and all its second-order interactions. Suppose that an 
interaction term to be added only after the corresponding main effects 

Xi and Xj . The hierarchy graph is formed by adding an arrow from each main 
effect to each of its interaction terms. Figure 4 shows the hierarchy graph 
for d = 4. Figure 5 shows sample paths for d = 4 with penalties based on (9) 
and using this hierarchy graph and having different settings for 7^. These 
sample paths were obtained by setting 0\ = 20, 02 = 10, 0s = 5, 0\ t 2 = 15 
and 02,3 = 7. All remaining coefficients in the model are set to zero. Each 
predictor has standard normal distribution and the signal to noise ratio is 
set to 2. Because of the large effect of some interaction terms they are added 
to the model before their respective main effects when the LASSO is used. 
However, setting 7^ to be slightly larger than 1 is already enough to cause 
the hierarchical structure to be satisfied. We develop this example further 
in one of the simulation studies in Section 4. 

3. Computing CAP estimates. The proper value of the regularization 
parameter A to use with CAP penalties is rarely known in advance. Two 
ingredients are then needed to implement CAP in practice: efficient ways of 
computing estimates for different values of A and a criterion for choosing an 
appropriate A. This section proposes methods for completing these tasks. 

3.1. Tracing the CAP regularization path. Convexity is a key property 
for solving optimization problems such as the one defining CAP estimates 




Fig. 4. Directed graphs representing hierarchies: (a) The hierarchy of the simple example: 
X 1 must precede X2; (b) Hierarchy for a main and interaction effects model with four 
variables. The "root nodes" correspond to the main effects and must be added to the model 
before its children. Each main effect has all the interactions in which it takes part as its 
children. Each second order interaction effect has two parents. 
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Fig. 5. A sample regularization path for the simple AN OVA example with four variables. 

In the LASSO path, an interaction (dotted lines) is allowed to enter the model when its 
corresponding main effect (solid lines) is not in the model. When the group norm 7^ is 
greater than one, the hierarchy is respected. From left to right: (a) Lasso (70 = 1,7* = 1); 
(b) CAP(l.l), (70 = l j7Jt =4), (c) GLasso (70 = 1,7* = 2), (d) CAP{A), (70 = 1,^=4), 
(e) iCAP, (70 = l,7fc = 00). 



(7). When the objective function is convex, a point satisfying the Karush- 
Kuhn-Tucker (KKT) conditions is necessarily a global minimum (see [2]). 
As the algorithms we present below rely on tracing solutions to the KKT 
conditions for different values of A, we now present sufficient conditions for 
convexity of the CAP program (7). A proof is given in Appendix A. 

Theorem 2. If ^ >l,Vi = 0,...,K, then T(j3) in (6) is convex. If, in 
addition, the loss function L is convex in (3 the objective function of the 
CAP optimization problem in (7) is convex. 

We now detail algorithms for computing the CAP regularization path. 
The BLasso algorithm is used to deal with general convex loss functions and 
CAP penalties. Under the L2-I0SS with 70 = 1 and 7^ = 00, we introduce the 
iCAP (oo-CAP) and the hiCAP (hierarchical-oo-CAP) algorithms to trace 
the path for group and tree-structured hierarchical selection, respectively. 

3.1.1. The BLasso algorithm. The BLasso algorithm [24] can be used 
to approximate the regularization path for general convex loss and penalty 
functions. We use the BLasso algorithm in our experimental section due 
to its ease of implementation and flexibility: the same code was used for 
different settings of the CAP penalty. 

Similarly to boosting [10] and the Forward Stagewise Fitting algorithm 
[8], the BLasso algorithm works by taking forward steps of fixed size in 
the direction of steepest descent of the loss function. However, BLasso also 
allows for backward steps that take the penalty into account. With the 
addition of these backward steps, the BLasso is proven to approximate the 
Lasso path arbitrarily close, provided the step size can get small. An added 
advantage of the algorithm is its ability to trade off between precision and 
computational expense by adjusting the step size. For a detailed description 
of the algorithm, we refer the reader to Zhao and Yu [24]. 
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3.1.2. Piecewise linear paths: L2-I0SS, L\-norm and L^-norm penaliza- 
tion. For piecewise quadratic, convex loss functions and 7^ £ {1, 00} for all 
k = 0, . . . , K, the CAP regularization path is known to be piecewise linear 
[18]. In these cases, it is possible to devise algorithms that jump from one 
breakpoint to the next while exactly computing their respective estimates as 
in the homotopy/LARS-LASSO algorithm [8, 17]. Next, we introduce two 
such algorithms for the I/2-loss: the first (iCAP) for grouped selection and 
the second for hierarchical selection (hiCAP). Before that, we present an 
algorithm for the L2-I0SS estimates penalized by the Loo-norm (iLASSO). It 
serves as a stepping stone between the homotopy/LARS-LASSO [8, 17] and 
the iCAP and hiCAP path-tracing algorithms. 



The regularization path for iLASSO (00-LASSO). The iLASSO estimate 
corresponds to the bridge regression (2) with the L2-I0SS and 7 = 00. The 
KKT conditions defining the estimate for a particular A are 

X 1I X X U X 



(10) 



X U X X K X 



X U X X U\ 





a 




x n x Y ~ A 








. X Ux Y . 



Pn x = 



where TZ X = {j : \(3j\ = [|/3||oo>, U x = {j : < \\P\\oo}, S(X) = signsLY'(Y - 
X(3(\))], and X-jz x = J2j£iz x SjW x j- From these conditions, it follows that 
|4(A)| = d, for all j £ Tlx and X'^Y - X(5{\)) = 0, for all j G U\. Starting 
from a breakpoint Ao and its respective estimate $(\q), the path moves in 
a direction A/3 that preserves the KKT conditions. The next breakpoint 
is then determined by (3\ 1 = (3\ + 5 ■ A/3 where 5 > is the least value to 
cause an index to move between the 1Z\ and U\ sets. The pseudo-code for 
the iLASSO algorithm is presented in the technical report version of this 
paper [25]. We now extend this algorithm to handle the grouped case. 

The iCAP algorithm (oo-CAP). The iCAP algorithm is valid for the 
L2-I0SS and nonoverlapping groups with 70 = 1 and 7^ = 00. This algorithm 
operates on two levels: it behaves as the Lasso on the group level and as 
the iLASSO within each group. To make this precise, first define the kth 
group correlation at A to be Cfc(A) = \\X'g (Y— X/3(A))||i and the set of active 
groups A\ = {j E {1, . . . ,K} : \cj(X)\ = max k =i y ..,K |cfe(A)|}. At a given A, the 
groups not in A\ have all their coefficients set to zero and /3(A) is such that all 
groups in A\ have the same group correlation size. At the within-group level, 
for /3(A) to be a solution, each index j £ Qk must belong to either of two sets: 
Ux, k = {j £ G k : Xj(Y - A/3(A)) = 0} or K x , k = {j £ 9k ■ Pj(\) = life (A) |[oo}- 

For Ao = maxj g00v .. j ^ Y"||i, the solution is given by /3(Ao) = 0. From 
this point, a direction A/3 can be found so that the conditions above are 
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satisfied by /3(Ao) + 5A(3 for small enough 5 > 0. To find the next break- 
point, compute the least value of 5 > that causes one of the following 
events: a group is added to or removed from A\; an index moves between 
the sets U\ : k and TZ\ t k for some group k; or a sign change occurs in the 
correlation between the residuals and a variable in an inactive group. If 
no such 5 > exists, the algorithm moves toward an un-regularized so- 
lution along the direction A/3. The pseudo-code is given in Appendix B. 
The Matlab code implementing this algorithm can be downloaded from 
http : //www. stat .berkeley.edu/twiki/Research/YuGroup/Software. 

The hiCAP algorithm (hierarchical-oo-CAP). We now introduce an al- 
gorithm for hierarchical selection. It is valid for the L2-I0SS when 70 = 1, 
7/j = 00 and the graph representing the hierarchy is a tree. The KKT con- 
ditions in this case are to an extent similar to those of the nonoverlapping 
groups case. The difference is that the groups now change dynamically along 
the path. Specifics of the algorithm are lengthy to describe and do not pro- 
vide much statistical insight. We give a high level description here and refer 
readers interested in implementing the algorithm to the code available at 
http : //www. stat .berkeley.edu/twiki/Research/YuGroup/Software. 

The algorithm starts by forming nonoverlapping groups such that: 

• Each nonoverlapping group consists of a sub-tree. 

• Viewing each of the subtrees formed as a supernode, the derived super- 
tree formed by these supernodes must satisfy the condition that average 
correlation size (unsigned) between Y and A"s within the supernode is 
higher than that of all its descendant supernodes. 

Once these groups are formed, the algorithm starts by moving the coef- 
ficients in the root group as in the nonoverlapping iCAP algorithm. The 
optimality conditions are met because the root group has the highest aver- 
age correlation. Then, the algorithm proceeds observing two constraints: 

1. Average unsigned correlation between Y — X/3(\) and X's within each 
supernode is at least as high than that of all its descendant supernodes. 

2. Maximum unsigned coefficient of each supernode is larger than or equal 
to that of any of its descendants. 

Between breakpoints, the path is found by determining a direction such 
that these conditions are met. Breakpoints are found by noticing they are 
characterized by: 

• If the average correlation between Y — X (3{^X) and a subtree Q a contained 
by a supernode equals that of a supernode, then Q a splits into a new 
supernode. 
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• If a supernode a's maximum coefficient size equals that of a descendant 
supernode b, then they are combined into a new supernode. These would 
also guarantee that a super node with all zero coefficients should have 
descendants with all zero coefficients. 

• If a supernode with all zero coefficients and a descendant reached equal 
average correlation size (unsigned), they are merged. 

3.2. Choosing the regularization parameter A. For the selection of the 
regularization parameter A under general CAP penalties, we propose the use 
of cross-validation. We refer the reader to Stone [20] and Efron [6] for details 
on cross-validation. For the particular case of the iCAP having nonoverlap- 
ping groups and under the L2-I0SS, we extend the Zou, Hastie and Tibshirani 
[27] unbiased estimate of the degrees of freedom for the LASSO. This esti- 
mate can then be used in conjunction with an information criterion to pick 
an appropriate level of regularization. We choose to use Sugiura's [21] AICc 
information criterion. Our preference for this criterion follows from its being 
a correction of Akaike's [1] AIC for small samples. Letting k denote the 
dimensionality of a linear model, the AICc criterion for a linear model is 



where n corresponds to the sample size. A model is selected by minimizing 
the expression in (11). To apply this criterion to iCAP, we substitute k by 
estimates of the effective number of parameters (degrees of freedom). As we 
will see in our experimental section, the AICc criterion resulted in good 
predictive performance, even in the "small-n, large-p" setting. 

The extension of Zou, Hastie and Tibshirani [27] unbiased estimates of the 
degrees of freedom to iCAP follows from noticing that, between breakpoints, 
the LASSO, iLASSO and iCAP estimates mimic the behavior of projections 
on linear subspaces spanned by subsets of the predictor variables. It fol- 
lows that standard results for linear estimates establish the dimension of 
the projecting subspace as an unbiased estimate of the degrees of freedom 
of penalized estimates in a broader set of penalties. Therefore, to compute 
unbiased estimates of degrees of freedom for the iLASSO or iCAP regres- 
sions above, it is enough to count the number of free parameters being fit at 
a point in the path. Letting A\ and Uk,\ be as defined in Section 3.1 above, 
the resulting estimates of the degrees of freedom are df(A) = \U\ \ + 1 for the 
iLASSO and df(A) = \A\\ + Y,keA x \ U k,\\ for the iCAP. A complete proof 
for the iLASSO case and a sketch of the proof for the iCAP case are given 
in Appendix A.l. Based on the above reasoning, we believe similar results 
should hold under the L2-I0SS for any CAP penalties built exclusively from 
the Li-norm and the Loo-norm. The missing ingredient is a way of deter- 
mining the dimension of a projecting subspace along the path in broader 
settings. Future research will be devoted to that. 
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4. Experimental results. We now illustrate and evaluate the use of CAP 

in a series of simulated examples. The CAP framework needs the input 
of a grouping or hierarchical structure. For the grouping simulations, we 
explore the possibility of having a data-driven group structure choice and 
show that CAP enjoys a certain degree of robustness to misspecified groups. 
We leave a completely data-driven group structure choice as a topic of future 
research. For the hierarchical selection, two examples illustrate the use of 
the framework presented in Section 2.2.2 and the penalty in (9) to induce 
a pre-determined hierarchy. We defer the study of data-defined hierarchies 
for future work. 

We will be comparing the predictive performance, sparsity and parsimony 
of different CAP estimates and that of the LASSO. As a measure of predic- 
tion performance we use the model error ME{(5) = {(5 — j3)K{X' X){(3 — (3). 
In what concerns sparsity, we look not only at the number of variables se- 
lected by CAP and LASSO, but also at measures of sparsity that take the 
added structure into account: the number of selected groups for group selec- 
tion and a properly defined measure for hierarchical selection (see Section 
4.2). Whenever possible, we also compare the parsimony of the selected 
models as measured by the effective degrees of freedom and compare the 
cross-validation and AICc based selections for the tuning parameter A. 

In all experiments below, the data is simulated from 

(12) Y = X(3 + ae, 

with e ~ A/"(0,I). The parameters (3, a as well as the covariance structure of 
X are set specifically for each experiment. 

4.1. Grouped selection results. In our grouping experiments, the group 
structure among the p predictors in X is due to their relationship to a set 
of K zero- mean Gaussian hidden factors Z G W K : 

(2.0, if \k-k'\ =0, 
cov(Z k ,Z k >) = < 1.0, if \k-k'\ = l, for k,k' € {1,...,K}. 
t 0.0, if \k-k'\ > 1, 

A predictor Xj in group Q k is the sum of the Z k factor and a noise term 

Xj = Z k + r}j ifj'G^/c, 

with the Gaussian r/j noise term having zero mean and 

cov(7? i ,7? i /) = (4.0) •0.95 |j ~ j ' 1 . 

The correlation between the factors Z is meant to make group selection 
slightly more challenging. Also, we chose the covariance structure of the dis- 
turbance terms rj to avoid an overly easy case for the empirical determination 
of the groups described below. 
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Clustering for forming groups. In applications, the grouping structure 
must often be estimated from data. In our grouping experiments, we esti- 
mate the true group structure Q by clustering the predictors in X using the 
Partitioning Around Medoids (PAM) algorithm (see [13]). The PAM algo- 
rithm needs the number of groups as an input. Instead of trying to fine-tune 
the selection of the number of groups, we fix the number of groups K in the 
estimated clustering above, below and at the correct value K, specifically, 
we set: 

• K = K: proper estimation of the number of groups; 

• K = 0.5 • K: severe underestimation of the number of groups; 

• K = 1.5 • K: severe overestimation of the number of groups. 

Implicitly, we are assuming that a tuning method that can set the number 
of estimated groups K in PAM to be between 0.5K (alt. below 1.5K) and 
the true number of groups K has results that are no worse than the ones 
observed in our underestimated (alt. overestimated) scenario below. 

4.1.1. Effect of group norms and A selection methods. In this first ex- 
periment, we want to compare the difference among CAP estimates using 
alternative settings for the within-group norm and the LASSO. We keep the 
dimensionality of the problem low and emulate a high-dimensional setting 
(n < p) by setting 

n = 80, p= 100 and K = 10. 

The coefficients (3 are made dissimilar (see Figure 6) within a group to 
avoid undue advantage to iCAP: 

! 0.10(1 + 0.9-?- 1 ), for j = 1, ... , 10; 

0.04(1 + 0.9 J ~ n ), for j = 11, . . . , 20; 

0.01(1 + 0.9 j ~ 21 ), for j = 21, . . . , 30; 

0, otherwise. 

The noise level is set to a = 3 and results are based on 50 replications. 

The results reported in Table 1 show that all the different CAP penalties 
considered significantly reduced the model error in the comparison with the 
LASSO. The reduction in model error was also observed to be robust to 
misspecifications of the group structure (the cases K = 0.5 • K and K = 
1.5 -K). 

Table 2 shows that our estimate for the degrees of freedom used in con- 
junction with the AICc criterion was able to select predictive models as 
good as 10-fold cross-validation at a lower computational expense. The com- 
parison of the results across the number of clusters used to group the predic- 
tors show that the improvement in prediction was robust to misspecifications 
in the number of groups used to cluster the predictors. iCAP's performance 
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Table 1 

Comparison of different CAP penalties. Results for the LASSO and different CAP 
penalties for the first grouping experiment (based on 50 replications). All CAP penalties 

considered reduced the mean model error and number of selected groups in the 
comparison with the LASSO. The CAP penalty with j k = 4 had a slight advantage over 

iCAP and the GLASSO 



N. groups 


LASSO 


GLASSO 


CAP(4) 


iCAP 






Model errors 






Underestimated 


1.863 


1.025 


0.918 


1.429 


(0.5K) 


(0.194) 


(0.101) 


(0.106) 


(0.316) 


Right 


1.863 


1.048 


0.835 


0.933 


(1.0K) 


(0.194) 


(0.094) 


(0.100) 


(0.092) 


Overestimated 


1.863 


1.159 


0.970 


1.271 


(1.5K) 


(0.194) 


(0.089) 


(0.090) 


(0.135) 




Number of selected variables 






Underestimated 


13.567 


45.233 


39.650 


65.333 


(0.5K) 


(1.243) 


(3.504) 


(3.490) 


(4.633) 


Right 


13.567 


38.200 


32.450 


49.000 


(1.0K) 


(1.243) 


(2.502) 


(1.748) 


(3.567) 


Overestimated 


13.567 


33.900 


33.450 


48.400 


(1.5K) 


(1.243) 


(2.452) 


(3.097) 


(3.461) 




Number of selected groups 






Underestimated 


6.233 


5.600 


4.600 


6.933 


(0.5K) 


(0.491) 


(0.400) 


(0.387) 


(0.442) 


Right 


6.233 


4.067 


3.250 


4.900 


(1.0K) 


(0.491) 


(0.275) 


(0.176) 


(0.357) 


Overestimated 


6.233 


4.100 


3.950 


5.333 


(1.5K) 


(0.491) 


(0.326) 


(0.352) 


(0.402) 






Table 2 






Comparison of model 


errors according to A selection method. Mean diff 


erence in model 


error between CAP and LASSO models selected using the AICc criterion and 10-fold 


cross-validation. The AICc criterion had results comparable 


to 10-fold 


cross-validation 


ME(A/Cc)-ME(CV) 






LASSO ( 7fe = 1) 




iCAP ( 7fc = oo) 


Underestimated 




-0.253 




-0.077 


(0.5K) 




(0.177) 




(0.048) 


Right 




-0.470 




-0.267 


(1.0K) 




(0.388) 




(0.207) 


Overestimated 




-0.324 




-0.112 


(1.5K) 




(0.245) 




(0.065) 
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was the most sensitive to this type of misspecification as the Loo-norm makes 
a heavier use of the prespecified grouping information. 

In terms of sparsity, the CAP estimates include a larger number of vari- 
ables than the LASSO due to its block-inclusion nature. If we look at how 
many of the true groups are selected instead, we see that the CAP estimates 
made use of a lesser number of groups than the LASSO: an advantage if 
group selection is the goal. The low ratio between the number of variables 
and number of groups selected by the LASSO provide evidence that the 
LASSO estimates did not preserve the group structure selecting only a few 
variables from each group. 

4.1.2. Grouping with small-n-large-p. We now compare iCAP and the 
LASSO when the number of predictors p = Kq grows due to an increase 
in either the number of groups K or in the group size q. The sample size 
is fixed at n = 80. The coefficients are randomly selected for each replica- 
tion according to two different schemes: in the Grouped Laplacian scheme 
the coefficients are constant within each group and equal to K independent 
samples from a Laplacian distribution with parameter ac', in the Individ- 
ual Laplacian scheme the p coefficients are independently sampled from a 
Laplacian distribution with parameter a/. The Grouped Laplacian scheme 
favors iCAP due to the grouped structure of the coefficients whereas the 
Individual Laplacian scheme favors the LASSO. The parameters aa and aj 
were adjusted so the signal power E(/3'A'X/3) is roughly constant across 
experiments. The complete set of parameters used is shown in Table 3. 

We only report the results obtained from using the AICc criterion to 
select A. The results for 10-fold cross-validation were similar. Tables 4 and 
5 show the results based on 100 replications. 

Coefficients for grouping experiment 

025 1 1 • 1 1 1 n 




Fig. 6. Profile of coefficients for grouping experiment. In the first grouping experiment, 
only the first three groups have nonzero coefficients. Within each group, the coefficients 
have an exponential decay. 
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The iCAP estimates had predictive performance better than or compa- 
rable to LASSO estimates. For the Grouped Laplacian case, the reduction 
in model error was very pronounced for all settings considered. In the In- 
dividual Laplacian case, iCAP and LASSO had comparable model errors 
for p = 100. As the number of predictors increased, however, iCAP resulted 
in lower model errors than the LASSO even under the Individual Lapla- 
cian regime. This result provides evidence that collecting highly correlated 
predictors into groups is beneficial in terms of predicting performance, espe- 
cially as the ratio p/n increases. The predictive gains from using CAP were 
also preserved under misspecified groupings. iCAP estimates had smaller or 
comparable model errors than the LASSO when the predictors were clus- 
tered into a number of groups that was 50% smaller or larger than the actual 
number of predictor clusters. 

Contrary to what was expected, the iCAP estimates involved a number 
of groups comparable to the LASSO in all simulated scenarios. We believe 
this is due to a more parsimonious use of degrees of freedom by iCAP esti- 
mates. As more groups are added to the model, the within-group restrictions 
imposed by the Loo-norm prevent a rapid increase in the use of degrees of 
freedom. As a result, iCAP estimates can afford to include groups in the 
model more aggressively in an attempt to reduce the L2-I0SS. This reason- 
ing is supported by the lesser degrees of freedom used by iCAP selected 
models in the comparison to the LASSO. 

4.2. Hierarchical selection results. Now, we provide examples where the 
hierarchical structure of the predictors is exploited to enhance the predictive 
performance of models. We define hierarchical gap as a measure of compli- 
ance to the given hierarchical structure: it is the minimal number of ad- 
ditional variables that should be added to the model so the hierarchy is 
satisfied. If a model satisfies a given hierarchy, no additional variables must 
be added and this measure equals zero. 

4.2.1. Hierarchical selection for ANOVA model with interaction terms. 
We now further develop the regression model with interaction terms intro- 
duced in Section 2 (cf. Figure 4). 



Table 3 

Parameters for the simulation of the small-n-large-p case 



p 


Q 


K 


Grouped Laplacian 




Individual Laplacian 




a 


q g E((3'X'Xf3) 


SNR 


a. i 


E(P'X'Xf3) 


SNR 


100 


10 


10 


1.00 10" 1 54.02 


3.95 


3.00- lO" 1 


54.00 


3.95 


3.7 


250 


10 


25 


0.63 10" 1 53.60 


3.92 


1.90- 10" 1 


54.15 


3.96 


3.7 


250 


25 


10 


0.43 10 _1 54.61 


3.99 


1.90- 10" 1 


54.15 


3.96 


3.7 
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Table 4 

Results for small-n-large-p experiment under Grouped Laplacian sampling. Results based 
on 100 replications and AICc selected A. The true model has its parameters sampled 
according to the Grouped Laplacian scheme (see Section 4.1.2). The inclusion of 
grouping structure improves the predictive performance of the models whether the 
predictors are clustered in the correct number of groups (1.0K) or not (0.5K and 1.5K). 
LASSO selects a smaller number of variables and is slightly sparser in terms of number 
of groups. iCAP estimates are more parsimonious in terms of degrees of freedom 







LASSO 


0.5K 


1.0K 


1.5K 








Model 


errors 




V 


= 100, g = 10 


5.028 


3.783 


2.839 


3.481 






(0.208) 


(0.172) 


(0.119) 


(0.132) 


V 


= 250, 9 = 10 


13.061 


11.135 


6.660 


8.128 






(0.506) 


(0.834) 


(0.227) 


(0.271) 


V 


= 250, g = 25 


8.356 


5.113 


3.479 


4.457 






(0.379) 


(0.228) 


(0.149) 


(0.202) 








Number of selected variables 




p 


= 100, q = 10 


19.590 


93.940 


82.200 


73.720 






(0.546) 


(1.108) 


(1.508) 


(1.438) 


p 


= 250, g = 10 


26.070 


211.090 


169.700 


144.740 






(0.668) 


(3.335) 


(2.866) 


(2.781) 


n 


— 250 a — 25 


25.450 


236.300 


218.500 


192.730 






(0.664) 


(2.979) 


(3.049) 


(3.667) 








Number of true 


groups selected 


V 


= 100, q = 10 


8.140 


9.520 


8.220 


8.240 






(0.163) 


(0.090) 


(0.151) 


(0.148) 


V 


= 250, g = 10 


14.550 


21.780 


16.970 


16.720 






(0.336) 


(0.290) 


(0.287) 


(0.317) 


p 


= 250, g = 25 


7.980 


9.560 


8.740 


8.530 






(0.160) 


(0.102) 


(0.122) 


(0.149) 








Degrees of freedom 




V 


= 100, 9 = 10 


19.590 


13.600 


10.460 


12.060 






(0.546) 


(0.459) 


(0.365) 


(0.404) 


p 


= 250, 9 = 10 


26.070 


19.710 


18.490 


20.190 






(0.668) 


(0.646) 


(0.431) 


(0.501) 


V 


= 250, 9 = 25 


25.450 


18.010 


13.590 


14.920 






(0.664) 


(0.604) 


(0.481) 


(0.491) 


d 


The data is j 
= 10 resultinj 


generated according to (12) and set the number of variables to 
nna regression involving 10 main effects and 45 interactions: 



X — [Zi, . . . , Zio, Z1Z2, . . . , Z\Z\o, . . . , ZqZiq], 
Z/-~-AA(0,l). 
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Table 5 

Results for small-n-large-p experiment under Individual Laplacian sampling. Results based 
on 100 replications and AICc selected X. The true model has its parameters sampled 
according to the Independent Laplacian scheme (see Section 4.1.2). For a 
lower-dimensional model (p = 100, q — 10), the predictive performance of iCAP is 
comparable to the LASSO. For higher dimensions (p = 250, q — 10 and q = 25), iCAP 
has better predictive performance. LASSO selects a smaller number of variables than 
iCAP and a comparable number of groups in all cases. iCAP estimates are still more 
parsimonious in terms of degrees of freedom 



LASSO 



0.5K 



1.0K 



1.5K 



p = 100, g = 10 
p = 250, g = 10 
p = 250, q = 25 

p = 100, g = 10 
p = 250, q = 10 
p = 250, 9 = 25 

p = 100, g = 10 



250, 



10 



p = 250, q = 25 

p = 100, 9 = 10 
p = 250, 9 = 10 
p = 250, 9 = 25 



10.310 

(0.309) 

22.560 

(0.701) 

19.891 

(0.614) 



20.200 

(0.620) 

25.540 

(0.620) 

24.440 

(0.589) 

9.580 
(0.106) 
21.210 
(0.309) 

9.720 
(0.057) 

20.200 

(0.620) 

25.540 

(0.620) 

24.440 

(0.589) 



Model errors 



10.885 

(0.388) 

18.790 

(0.446) 

17.483 

(0.424) 



10.153 

(0.300) 

18.194 

(0.463) 

16.544 

(0.387) 



Number of selected variables 



96.460 

(0.869) 
228.070 

(2.246) 
243.490 

(1.530) 



90.700 

(1.249) 
180.700 

(3.003) 
234.250 

(1.935) 



Number of true groups selected 



9.760 
(0.092) 
24.110 
(0.115) 

9.870 
(0.049) 



9.480 
(0.098) 
21.580 
(0.266) 

9.730 
(0.066) 



Degrees of freedom 



16.320 

(0.684) 

22.230 

(0.630) 

20.360 

(0.610) 



16.140 
(0.635) 
20.290 
(0.486) 

20.060 
(0.635) 



11.056 

(0.348) 

18.990 

(0.422) 

18.301 

(0.493) 

73.530 

(1.459) 
150.000 

(2.508) 
198.040 

(2.638) 

9.320 
(0.119) 
20.720 
(0.290) 

9.650 
(0.069) 

14.650 
(0.614) 

21.210 
(0.536) 
18.670 
(0.637) 



We assume the hierarchical structure is given that the second order terms 
are to be added to the model only after their corresponding main effects. 
This is an extension of the hierarchical structure in Figure 4 from d = 4 to 
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Table 6 

Simulation setup for the ANOVA experiment 



Coefficients 



Description 


Z\Z-2 




Z\Za 


ZiZ% 


Z^Z^ 


Z%Z/± 


a 


SNR 


No interactions 




















3.7 


4.02 


Weak interactions 


0.5 








0.1 


0.1 





3.7 


4.08 


Moderate interactions 


1.0 








0.5 


0.4 


0.1 


3.7 


4.33 


Strong interactions 


5 








4 


2 





3.7 


13.88 


Very strong interactions 


7 


7 


7 


2 


2 


1 


3.7 


38.20 



d = 10. Applying (9) to this graph with uniform a m weights gives: 

io j'-i 

rG9)=EE[iAji + ii(A,^,Aj)ii 7( J- 

j=l i=l 

In this case, each interaction term is penalized in three factors of the sum- 
mation, which agrees to the number of variables that are added to the model 
(Zij, Zi and Z 3 ). 

We set the first four main effect coefficients to be (3\ = 7, fii = 2, (3$ = 1 
and /?4 = 1 with all remaining main effects set to zero. The main effects are 
kept fixed throughout and five different levels of interaction strengths are 
considered as shown in Table 6. The variance of the noise term was kept 
constant across the experiments. The number of observations n was set to 
121. The results reported in Table 7 were obtained from 100 replications. 

For low and moderate interactions, the introduction of hierarchical struc- 
ture reduced the model error from the LASSO. For strong interactions, the 
CAP and LASSO results were comparable. For very strong interactions, the 
implicit assumption of smaller second order effects embedded in the hierar- 
chical selection is no longer suitable causing the LASSO to reach a better 
predictive performance. 

In addition, CAP selected models involving on average a slightly lesser 
or equal number of variables to the LASSO in all simulated cases. The 
hierarchical gap (see definition in Section 4.2) for the LASSO shows that it 
did not comply with the hierarchy of the problem. According to the theory 
developed in Section 2, for CAP estimates this difference should be exactly 
zero. The small deviations from zero observed in our experiments are due 
to the approximate nature of the BLasso algorithm. 

4.2.2. Multiresolution model. In this experiment, the true signal is given 
by a linear combination of Haar wavelets at different resolution levels. Let- 
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Table 7 

Simulation results for the hierarchical ANOVA example. Results based on 50 replications, 
121 observations and 10-fold CV. Hierarchical structure lead to reduced model error and 
sparser models. The hierarchy gap is the number of variables that must be added to the 
model so the hierarchy is satisfied. The LASSO does not respect the model hierarchy. The 
small deviations from zero for CAP estimates are due to BLASSO approximation 





LASSO 


"GLASSO" 


CAP(4) 


iCAP 






Model errors 






No interactions 


3.367 


1.481 


1.478 


1.466 




(0.288) 


(0.133) 


(0.134) 


(0.124) 


Weak interactions 


4.032 


2.190 


2.296 


2.117 




(0.303) 


(0.147) 


( f\ 1 CI \ 

(U.lOl ) 


(0.117) 


IvlOLieidLe illLei djCLlUIlo 


5.905 


4.260 


A 0Q0 
i.uyu 






(0.307) 


(0.205) 


(0.207) 


(0.196) 


Diiung iiiLui dciiuna 


8.912 


8.901 


i . i jo 


8 ^88 




(0.695) 


(0.621) 


(0.568) 


(0.626) 


Very strong int. 


11.474 


11.998 


1 A ^8 


9fi 079 
ZD.U 1 Zj 




(0.758) 


(0.800) 


(0.915) 


(1.351) 




Number of selected variables 






No interactions 


13.440 


12.720 


11.440 


9.280 




(0.897) 


(0.935) 


(0.725) 


(0.431) 


Weak interactions 


13.960 


13.780 


12.720 


9.920 




(0.936) 


(1.041) 


(0.806) 


(0.442) 


Moderate interactions 


16.160 


17.240 


14.960 


11.260 




(1.021) 


(1.015) 


(0.773) 


(0.488) 


Strong interactions 


21.800 


27.000 


20.680 


13.740 




(0.965) 


(0.910) 


(0.636) 


(0.384) 


Very strong int. 


26.400 


28.020 


20.520 


14.560 




(0.670) 


(0.577) 


(0.448) 


(0.289) 






Hierarchy gap 






No interactions 


3.560 


0.220 


0.420 


0.800 




(0.192) 


(0.066) 


(0.107) 


(0.128) 


Weak interactions 


3.560 


0.160 


0.340 


0.840 




(0.169) 


(0.052) 


(0.079) 


(0.112) 


Moderate interactions 


3.440 


0.420 


0.640 


1.020 




(0.174) 


(0.091) 


(0.106) 


(0.150) 


Strong interactions 


4.240 


0.740 


1.600 


2.020 




(0.184) 


(0.102) 


(0.143) 


(0.163) 


Very strong int. 


3.780 


1.080 


1.580 


0.920 




(0.155) 


(0.140) 


(0.143) 


(0.137) 



ting Z{j denote the Haar wavelet at the jth position of level i, we have 



Zij(t) 



-1, 



if t € 



j j + 1 



1, 
10, 



i + i i±2 

2«+i ' 2*+i 
otherwise, 



if t G 



for i £ N, j 



0,1,2,. ..,2 



i-l 
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case 5 {a = 47.43, SNR= 0.4) 
deep tree 



Fig. 7. Coefficients and hierarchies used in the wavelet tree example. 

In what follows, we let a vector Zij be formed by computing Zij(t) at 16 
equally spaced "time points" over [0, 1] . The design matrix X is then given 
by [Zoo, Z\o Zix, Z20 ••• Z23, Z30 ■•■ Z$t\- 

We consider five different settings for the value of (3 with various sparsity 
levels and tree depths. The parameter a is adjusted to keep the signal to 
noise ratio at 0.4 in all cases. The true model parameters are shown within 
the tree hierarchy in Figure 7. 

The simulated data corresponds to 5 sets of observations of the 16 "time 
points." Five-fold cross-validation was used for selecting the regularization 
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level. At each cross-validation round, the 16 points kept out of the sample 
correspond to each of the "time positions" ( "balanced cross-validation" ) . 

In its upper left panel, Figure 7 shows the directed graph representing 
the tree hierarchy used to form the CAP penalty using the recipe laid out 
by (9) in Section 2 with a m = 1 for all m. Our option for setting all weights 
to 1 in this case is similar to the one presented in the ANOVA experiment 
above: the number of variables added to the model when a variable is added 
matches the number of times its coefficient appears on the penalty. 

The results for the hierarchical selection are shown in Table 8. The use 
of the hierarchical information greatly reduced the model error as well as 
the number of selected variables. As in the ANOVA cases, the hierarchical 
gap shows that the LASSO models do not satisfy the tree hierarchy. The 
approximate nature of the BLASSO algorithm again causes the hierarchi- 
cal gap of GLASSO and CAP(4) to deviate slightly from zero. For hiCAP 
estimates, perfect agreement with the hierarchy is observed as the hiCAP 
algorithm (see Section 3) is exact. 

4.3. Additional experiments. In addition to the experiments presented 
above, we have also run CAP under examples taken from Yuan and Lin 
[23] and Zou and Hastie [26]. The results are similar to the ones obtained 
above: CAP results in improved prediction performance over the LASSO 
with models involving a larger number of variables, a similar or smaller 
number of groups and making use of less degrees of freedom. We invite the 
reader to check the details in the technical report version of this paper [25]. 

5. Discussion and concluding remarks. In this paper, we have introduced 
the Composite Absolute Penalty (CAP) family. It provides a regularization 
framework for incorporating predetermined grouping and hierarchical struc- 
tures among the predictors by combining L 7 -norm penalties and applying 
them to properly defined groups of coefficients. 

The definition of the groups to which norms are applied is instrumental in 
determining the properties of CAP estimates. Nonoverlapping groups give 
rise to group selection as done previously by Yuan and Lin [23] and Kim, 
Kim and Kim [14] where L\ and L2 norms are combined. CAP penalties 
extend these works by letting norms other than the L2-norm to be applied 
to the groups of variables. Combinations of the L\ and are convenient 
from a computational standpoint as illustrated by the iCAP (nonoverlapping 
groups) with fast homotopy/LARS-type algorithms. Its Matlab code can be 
downloaded from our research group website http:/ /www.stat. berkeley.edu/ 
twiki /Research /YuGroup /Software . 

The definition of CAP penalties also generalizes previous work by allowing 
the groups to overlap. Here, we have shown how to construct overlapping 
groups leading to hierarchical selection. Combinations of the L\ and are 
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Table 8 

Simulation results for the hierarchical wavelet tree example. Results based on 200 
replications, 5 x 16 observations and 5 fold "balanced" CV. Hierarchical structure lead to 
reduced model error and sparser models. The hierarchy gap is the number of variables 
that must be added to the model so the hierarchy is satisfied. The LASSO does not 
respect the tree hierarchy. Small discrepancies in GLASSO and CAP(4) due to 
approximation in BLASSO 





LASSO 


GLASSO 


CAP (4) 


iCAP 






Model errors 






Root-only tree 


40.508 


26.994 


28.498 


28.909 




(2.039) 


(1.635) 


(1.650) 


(1.820) 


One-sided 


80.197 


57.100 


58.228 


57.195 




(2.954) 


(2.519) 


(2.465) 


(2.532) 


Complete tree 


112.578 


76.911 


79.498 


78.117 




(3.407) 


(2.661) 


(2.736) 


(2.733) 


Regular tree 


82.979 


58.020 


60.037 


60.259 




(2.738) 


(2.237) 


(2.252) 


(2.379) 


Heavy-leaved tree 


454.607 


388.262 


385.770 


359.154 




(11.950) 


(10.015) 


(9.884) 


(10.565) 




Number of selected variables 






Root-only tree 


4.070 


4.495 


4.405 


3.185 




(0.209) 


(0.265) 


(0.241) 


(0.244) 


One-sided 


6.080 


6.770 


6.235 


5.415 




(0.211) 


(0.247) 


(0.218) 


(0.245) 


Complete tree 


7.010 


7.605 


6.995 


6.720 




(0.226) 


(0.227) 


(0.213) 


(0.248) 


Regular tree 


6.140 


6.690 


6.255 


5.490 




(0.230) 


(0.243) 


(0.218) 


(0.250) 


Heavy-leaved tree 


10.985 


11.630 


10.930 


11.240 




(0.258) 


(0.192) 


(0.186) 


(0.239) 






Hierarchy gap 






Root-only tree 


1.765 


0.235 


0.360 


0.000 




(0.096) 


(0.038) 


(0.049) 


(0.000) 


One-sided 


0.710 


0.180 


0.300 


0.000 




(0.064) 


(0.031) 


(0.038) 


(0.000) 


Complete tree 


1.640 


0.310 


0.505 


0.000 




(0.091) 


(0.042) 


(0.057) 


(0.000) 


Regular tree 


1.210 


0.225 


0.335 


0.000 




(0.069) 


(0.030) 


(0.039) 


(0.000) 


Heavy-leaved tree 


1.455 


0.210 


0.370 


0.000 




(0.079) 


(0.034) 


(0.044) 


(0.000) 
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also computationally convenient to hierarchical selection as illustrated by 
the hiCAP algorithm for tree hierarchical selection (also available from our 
research group website). 

In a set of simulated examples, we have shown that CAP estimates us- 
ing a given grouping or hierarchical structure can reduce the model error 
when compared to LASSO estimates. In the grouped-selection case, we show 
such reduction has taken place in the "small-n-large-p" setting and was ob- 
served even when the groups were data-determined (noisy) and the resulting 
number of groups was within a large margin of the actual number of groups 
among the predictors (between 50% and 150% of the true number of groups). 

In addition, iCAP predictions are more parsimonious in terms of use of 
degrees of freedom being less sensitive to disturbances in the observed data 
[7]. Finally, CAP's ability to select models respecting the group and hier- 
archical structure of the problems makes its estimates more interpretable. 
It is a topic of our future research to explore ways to estimate group and 
hierarchical structures completely based on data. 

APPENDIX A: PROOFS 
PROOF of Theorem 1 . Algebraically, we have for 7 > 1 

As a result, if 02 > 0, 0\ is locally not penalized at and it will only stay 
at this point if the gradient of the loss function L is exactly zero for 0\ = 0. 
Unless the distribution of the gradient of the loss function has an atom at 
zero for 0\, 0\ 7^ with probability one. □ 

PROOF of Theorem 2. It is enough to prove that T is convex: 

1. T(a • 0) = \a\ ■ T(/9), for all a € M: For each group k, N k (a/3) = aN k {(3). 
Thus T(a0) = ||N(a/?)|| To = \a\ \\N(0) || 70 = \a\T(0); 

2. T(Pi + 2 ) < T(/?i) + T(0 2 ): Using the triangular inequality, 

T(j3i + h) = + /W < E(Wi) + W2)) 70 

k k 

= ||JV(fr) + N(0 2 )\\ 7O < ||iV(/?i)|| 70 + \\N(Jh)U 
= T(/3 1 )+T(0 2 ). 

Convexity follows by setting 1 = 90 3 and (3 2 = (1 - 0)04 with 9 € [0, 1]. □ 
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A.l. DF Estimates for iLASSO and iCAP. We now derive an unbiased 
estimate for the degrees of freedom of iLASSO fits along the regularization 
path. The optimization problem defining the iLASSO estimate is dual to 
the LASSO problem (see [2]). Facts 1 through 3 below follow from this 
duality and the results in Efron et al. [8] and Zou, Hastie and Tibshirani 
[27]. For the remainder of this section, we denote the iLASSO and iCAP fit 
by fi(X,y)=XP(X,y). 

Fact A.l. For each X, there exists a set 1C\ such that K,\ is the union 
of a finite collection of hyperplanes and for all Y E Ca = M n — 1C\, X is not 
a breakpoint in the regularization path. 

Fact A. 2. /3(A, y) is a continuous function of y for all X. 

Fact A. 3. // y £ C\, then the sets 1Z\ and U\ are locally invariant. 

From these three facts, we can prove: 

Lemma A.l. For a fixed A > and Y gC\ , fi(X,y) satisfies 

\\jl(X,y + Ay) - fi(X,y)\\ < \\Ay\\, for sufficiently small Ay 

and 

V-jl(X,y) = \Ux\ + l. 

Proof. We first notice that, fi(X, y) = X(3(X, Y) = [Xn x X Ux ) ■ [a fyj. 
From the optimality conditions for the penalty, 

(X'X)a(X, Y) = X'Y - X ■ sign(Y - Xa(X, Y)) 

and 

(X'X)a(X, Y + AY) = X'(Y + AY) - X ■ sign(y + AY - Xa{X, Y + Ay)). 

For Y G C\, there exists small Ay so the signs of the correlation be- 
tween the residuals and each predictor are preserved. Subtracting the two 
equations above: £(A, Y + AY) - fi(X, Y) = X (X' X)" 1 X' AY . Thus, /}(A, Y) 
behaves locally as a projection on a fixed subspace given by 1Z\ and IA\. From 
standard projection matrix results: \\X(3{X,y + Ay) — X(3(X,y)\\ < ||Ay||, for 
small Ay and V ■ fi{X,Y) =ti(X{X'Xy 1 X') = \U X \ + 1. □ 

Lemma A.l implies that the fit £i(X,y) is uniformly Lipschitz on M n (it is 
the closure of C\). Using Stein's lemma and the divergent expression above: 
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Theorem A.l. The L^-penalized fit jj,\{y) is uniformly Lipschitz for 
all A. The degrees of freedom of fi\(y) is given by df(X) = E[\U\\] + 1. 

The proof for the case of nonoverlapping groups follows the same steps. 
We only present a sketch of the proof as a detailed proof is not very insight- 
ful. Fact A.l is proven by noticing that, for fixed A, each of the conditions 
defining breakpoints require Y to belong to a finite union of hyperplanes. 
Fact A. 2 follows from the CAP objective function being convex and con- 
tinuous in both A and Y. Fact A. 3 is established by noticing that the sets 
Ax and 1Zk,\,Vk = 1, . . . , K are invariant in between breakpoints. As before, 
the CAP fit behaves (except for a shrinkage factor) as a projection onto a 
subspace whose dimension is the number of "free" parameters at that point 
of the path. The result follows from standard arguments for linear estimates. 

APPENDIX B: PSEUDO-CODE FOR THE ICAP ALGORITHM 

1. Set t = 0, A t = maxfc||cfc(0)||, /3(At) = 0. 

2. Repeat until At = 0: 

(a) Set A\ t to contain all groups with c k = At ; 

(b) For each group k, set: 

U x , k = {j£G k :X>(Y-Xp(\ t )) = 0} and Tl x , k = G k -U kX , 

(c) Determine a direction A/3 such that: 

(i) if k i Ax, then A/? Sfc = 0; 

(ii) for k G A\, A/3-R. fc x = a k ■ Sx, k with a k chosen so: 

c k 0(\) + 5-Ap) 

= c k * (/3(A) +5-A0) for all k, k* G Ax and 
X^ x (X-X0(X)+8-A$)) 

= for small enough 5 > 0. 

(d) Compute the step sizes for which breakpoints occur: 

5 a = inf {c fe * 0x + 5- A/3) = c k ((3 x + S ■ A/3) for some k* 4 Ax and k G .4a}, 

<5>0 

5/ = mi{\\Pg k (X) + 5 ■ A/3IU = for some k G „4 A }, 
<5>0 

6u = inf {X' m (Y - X0(X) + 5 ■ A/3)) = for some m G £ fc with k G ^4 A }, 

<5>0 

6 R = inf {|/3 m (A) + 5 ■ A/3 m | = ||/% fc (A) + 5 ■ A^loo 
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for some m G Uk,\ with k € -4a}, 

5 S = inf {X' (Y - X($(X) + 5 ■ A/3)) = for some m G Q k with k 4. A x }, 
<5>0 

where we take the infimum over an empty set to be +oo. 

(e) Set t = t + 1, 6 = mm{5A,Si,5 R ,5u,5s,Xt}, h+i = ||cjfc0(A t ) + 8 ■ 
A/9) He and p(\ t+1 ) = /3(A t ) + S ■ A/3. 
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